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ABSTRACT 

With the growth of large photometric surveys, accurately estimating photometric red- 
shifts, preferably as a probability density function (PDF), and fully understanding the 
implicit systematic uncertainties in this process has become increasingly important. In 
this paper, we present a new, publicly available, parallel, machine learning algorithm 
that generates photometric redshift PDFs by using prediction trees and random forest 
techniques, which we have named TPZ. This new algorithm incorporates measurement 
errors into the calculation while also dealing efficiently with missing values in the data. 
In addition, our implementation of this algorithm provides supplementary information 
regarding the data being analyzed, including unbiased estimates of the accuracy of the 
technique without resorting to a validation data set, identification of poor photomet- 
ric redshift areas within the parameter space occupied by the spectroscopic training 
data, a quantification of the relative importance of the variables used to construct the 
PDF, and a robust identification of outliers. This extra information can be used to 
optimally target new spectroscopic observations and to improve the overall efficacy of 
the redshift estimation. We have tested TPZ on galaxy samples drawn from the SDSS 
main galaxy sample and from the DEEP2 survey, obtaining excellent results in each 
case. We also have tested our implementation by participating in the PHAT1 project, 
which is a blind photometric redshift contest, finding that TPZ performs comparable 
to if not better than other empirical photometric redshift algorithms. Finally, we dis- 
cuss the various parameters that control the operation of TPZ, the specific limitations 
of this approach and an application of photometric redshift PDFs. 

Key words: galaxies: distance and redshift statistics - surveys - statistics - methods: 
data analysis - statistical 



1 INTRODUCTION 

Late time cosmological measurements are often made by 
carefully measuring the three-dimensional distribution of 
galaxies. For these measurements, the distance between the 
galaxy and the observer is most accurately made by using 
a spectroscopic redshift. However, spectroscopic measure- 
ments are considerably more difficult to obtain, and are, 
therefore, more expensive than photometric measurements, 
as they require long exposures in order to achieve sufficient 
signal-to-noise over a wide wavelength range. As an exam- 
ple, while the Sloan Digital Sky Survey (SDSS; |York et al.j 



2000 1 has taken millions of spectroscopic redshifts of galax- 



ies to high precision (Aihara et al.|[20TT|, the same survey 



has obtained detailed photometric measurements for a much 
larger sample of galaxies in considerably less time. This di- 



chotomy will only grow with ongoing and planned surveys 
that are dominated by photometric-only observations. 

As a result, considerable attention has been focused 
on the estimation of redshifts by applying statistical tech- 
niques to the photometric observations of sources through 
different filters. These photometric redshift (hereafter photo- 
z) estimation techniques have become crucial for modern, 
multi-band digital surveys; and this need for fast and accu- 
rate photo-z estimation is becoming even more important 
for large photometric surveys like the Dark Energy Survey 
(DES[J and the Large Synoptic Survey Telescope (LSSlQ, 
which are probing galaxies that are often too faint to be 
spectroscopically observed. Adopting a photo-z approach 
allows cosmological measurements on galaxy samples that 
are currently at least a hundred times larger than compa- 
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rable spectroscopic samples, that have relatively simple and 
uniform selection functions, and that extend to fainter flux 
limits and larger angular scales and thus probe much larger 
cosmic volumes. In summary, photo-z techniques provide a 
much higher number of galaxies with redshift estimates per 



unit telescope time than spectroscopic surveys ( Hildebrandt 
et al.|2010| |. 



The estimation of galaxy redshifts using multi band 
photometry was first performed by |Baum| ( |1962[ ), while |Koo| 
( 1985 1 and Loh & Spillar ( 1986 1 were the first to compute 



galaxy redshifts by using digital photometric observations 
from charge coupled devices. In the last fifteen years, how- 
ever, the estimation of redshifts from broadband photome- 
try has grown significantly. Presently, there are many dif- 
ferent methods for computing photometric redshifts (see, 
e.g., |Wang et aLl|2008| |fflldebrandt et aT^|20Tol [Abdalla| 



|et al.|20TT for an updated comparison of current photomet- 
ric redshift methods and public codes). These techniques can 
be broadly categorized as either template fitting algorithms 
or empirical training algorithms. The template fitting algo- 
rithms (e.g., |Bem'tez||2000| IBolzonella et al. 



et al. 



et al. 



1980 



2003 



sert et al.||2006| IFeldmann et a 



2000 



20101 can either use empirical (e.g., 



Csabai 



2006 Assef 



Coleman et al. 



Assef et al.|2010 I or synthetic spectral templates (e.g., 



Bruzual Sz Charlot|20 03). These techniques estimate a pho- 



tometric redshift by finding the best match between the ob- 
served magnitudes or colors and the synthetic magnitude or 
colors from the suite of templates that are sampled across 
the expected redshift range of the photometric observations. 

Empirical training methods use a spectroscopic train- 
ing data set to calibrate an algorithm that can be quickly 
applied to new photometric observations. Initially the train- 
ing set was used to map a polynomial function between the 
colors and the redshift (e.g., Connolly et al. 1995 Brun- 
|ner et aL||1997[ ). More recently, this process has been ex- 
tended to machine learning algorithms, including artificial 
neural networks (e.g., Collister & Lahav 2004; Oyaizu et al 



2008b[ ), boosted decision trees (e.g., |Gerdes et al.||2010[ ), 
random forest (e.g., Carliles et al. 2010|, nea rest neigh- 
bors (e.g., |Ball et al.||2007| |2008| |Lima et al.||2008[ ), self- 
organized maps (e.g., |Geach 2012||Way fc Klose|2012[), spec- 



tral connectivity analysis (e.g., |Freeman et al.|2009 l, Gaus- 



sian process (e.g., Way et al. 2009 Bonfield et al. 20101, 



support vector machines (e.g., Wadadckar 20051 or Quasi 



Newton Algorithm (e.g., Cavuoti et al.|20"l2 l. While only a 
few of these photo-z methods are publicly available, they all 
perform to a similar accuracy and provide only a single red- 
shift estimate rather than a full redshift probability density 
function for each galaxy. 

The template fitting methods, which leverage model 
galaxy spectral energy distributions (SED), have been used 
extensively and are often preferred since once implemented 
they can be readily applied to new data by simply adopting 
the appropriate photometric filter transmission functions. 
Given a representative sample of template galaxy spectra, 
most of these techniques can reliably predict a photo-z, al- 
though the use of training data that includes known redshifts 
can improve these predictions (e.g., [Bem'tcz 2 000| |Ilbert| 
et al.|20 06). These techniques, however, are not exempt from 



uncertainties due to measurement errors on the survey filter 
transmission curves, mismatches when fitting the observed 
magnitudes or colors to template SEDs, and color-redshift 



degeneracies. Furthermore, template techniques generally 
become less reliable at high redshift where the uncertain- 
ties in galaxy SEDs increases, since the templates are often 
calibrated using low redshift galaxies. 

On the other hand, when provided with a high qual- 
ity spectroscopic training sample, empirical training tech- 
niques have been shown to have similar or even better per- 
formance (Collister & Lahav 2004[ ). In addition, empirical 
techniques are generally simpler to apply to different data 
sets and frequently provide an improved quantification of 
any uncertainties, which can be encoded in a photo-z prob- 
ability density function (PDF). They also have the addi- 
tional advantage that is easier to include extra information, 
such as galaxy profiles, concentration, angular sizes, or en- 
vironmental properties, in addition to magnitudes or colors. 
These methods, however, are only reliable within the limits 
of the training data, and sufficient caution must be exercised 
when extrapolating these algorithms beyond the limits of the 
training data. 

As the demand for more accurate photo-z methods has 
grown, techniques have branched out into new areas in or- 
der to improve the accuracy of photo-z estimation. While 
a complete understanding of the systematic uncertainties is 
needed for a reliable and accurate machine learning photo-z 
algorithm (see, e.g., |Oyaizu et al.|[2~008a| for a discussion 
on photometric redshift errors), other issues have recently 
been recognized in the effort to generate the most accurate 
photometric redshifts. For example, |Cunha et al.| ( |2012a|b] ) 
analyzed the effect of systematics within the spectroscopic 
training data set that is used to estimate a galaxy photo-z. 
Likewise, other functionality that a modern photo-z algo- 
rithm should provide include an identification of outliers on 
the training set that lead to an incorrect estimation of a 
photo-z, an identification of the features within the train- 
ing data that most strongly affect a photo-z estimate, and 
an identification of areas of parameter space (e.g., magni- 
tudes, colors, and redshift ranges) that are under sampled 
by the training data. The last two features are important 
to the design of photometric surveys, as they provide use- 
ful information to optimally and efficiently guide follow-up 
spectroscopy to generate the scientifically most useful train- 
ing data set for these algorithms. 

Of course, we estimate a galaxy's redshift so that it can 
be used in a subsequent analysis. A number of cosmologi- 
cal measurements such as galaxy clustering, weak lensing, 
baryon acoustic oscillations and the mass function of galaxy 
clusters (see, e.g., Ho et al.|2012 Reid et al.|2010 Jee et al 



20131, among others, depend strongly on both the num- 



ber of targeted galaxies in the sample and the accuracy of 
the measured distances to the galaxies. Given the growth of 
photometric-only surveys, these cosmological measurements 
will require the use of reliable photometric redshifts and a 
complete understanding of their uncertainties. As a result, 
photo-z methods will be most effective going forward if they 
can not only robustly provide a reliable redshift estimation 
but also a redshift probability density function. 

In addition, the extra information in a redshift PDF 
can be used to improve or enhance a particular cosmologi- 



cal measurement analysis. For example, Myers et al. (20091 



have shown that by using the full redshift PDF within a 
two-point angular quasar correlation function, as opposed to 
simply using a single redshift estimate, their measurement 
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has been improved by a factor of nearly four, which is equiv- 
alent to increasing the survey volume by a similar factor. 



gression, have been shown to be one of the most accurate al- 



Likewise, Mandelbaum et al. ( 2008 1 discuss how the accu- 



racy of photo-z and the inclusion of the photo-z PDF affect 
the calibration for weak lensing studies. Other recent stud- 
ies (see, e.g., |Sheth||2007) |van Breukelen fc Clewley||2009 1 



have also demonstrated how a cosmological measurement 
can be improved by using a photo-z PDF. However, given 
the lack of reliable photo-z PDF estimation techniques, this 
remains an underutilized tool. 

In this work, we address these issues by introducing 
TPZ (Trees for Photo-Z), a new, Python-based, machine 
learning, parallel code for estimating photometric redshift 
PDFs by using prediction trees and random forest tech- 
niques (Breiman et al. 1984, Brciman 2001). Our approach 
is an ensemble learning method that generates several clas- 
sifiers and combines their results into a final output. Predic- 
tion trees partition the multi-dimensional space recursively 
into smaller regions, which is terminated when a leaf only 
contains a few elements. Within these final leaves, our algo- 
rithm can leverage a simple model for the actual prediction, 
by using, for example, the mean value for a regression or the 
mode in a voting process as used in a classification scheme. 

Likewise, the basic idea of a random forest method is to 
use bootstrap samples from the training data to build a set 
of prediction trees. These trees are constructed by selecting 
the best split point from a random subsample of the dimen- 
sions (e.g., magnitudes or colors) along which the data are 
subdivided. By aggregating the predictions from this for- 
est of trees, we produce a more accurate estimate. In our 
implementation, we incorporate the errors on the measured 
attributes by perturbing the galaxy parameters by their un- 
certainties. We repeat this process, generating multiple in- 
dividual new observations of each galaxy that are subse- 
quently combined into a final PDF, which can be used as 
desired to estimate a single redshift and its associated error. 
In addition, our implementation of this technique naturally 
incorporates data with missing values and also provides ex- 
tra meta information, such as an unbiased estimate of the 
prediction error, a measure of the relative importance of the 
parameters used in the photo-z estimation as a function of 
redshift, an identification of regions where the training data 
provide poor predictions, and an identification of galaxies 
that are likely outliers. 

This paper is organized as follows. In §2 we provide a 
complete and detailed description of the photo-z method 
presented herein. §3 introduces the different data sets we 
use to test the efficacy and accuracy of TPZ and its unique 
capabilities. In §4 we describe the specific experiments we 
perform to test our photo-z implementation by using these 
data, present an analysis of the results and discuss the ca- 
pabilities of our approach. Finally in §5, we conclude with 
a summary of our main results and a discussion of the TPZ 
algorithm. 



2 METHODS 

Among the different non-linear methods that are used to 
compute photometric redshifts, prediction trees are one of 
the simplest yet most accurate techniques. Supervised learn- 
ing methods using prediction trees, either classification or re- 



gorithms for low as well high multi-dimensional data ( Caru- 
ana et al. 2008[). They also are fast, can easily deal with miss- 



ing data, and have similarities with other non-parametric 
technique. For example, prediction trees are similar to k- 
nearest-neighbor (kNN) algorithms in that they both group 
data points with similar characteristics. 

However, kNN use test data to identify similar points 
within the training set while keeping the parameter k fixed, 
even though some points might have a very different num- 
ber of similar neighbors. On the other hand, prediction trees 
have terminal leaves that bound regions of the parameter 
space where the predictions (i.e., redshifts) and their prop- 
erties (e.g., magnitudes) are similar. As both the quantity 
and identify of test data can vary between leaf (or termi- 
nal) nodes, prediction trees are known as adaptive nearest- 
neighbor methods ( Breiman et al.|1984 l. 



2.1 Prediction trees 

Prediction trees are built by asking a sequence of questions 
that recursively split the data, frequently into two branches, 
until a terminal leaf is created that meets a stopping crite- 
rion (e.g., a minimum leaf size). The small region bounding 
the data in the terminal leaf node represents a specific sub- 
sample of the entire data with similar properties. Within this 
leaf, a model is applied that provides a fairly comprehensi- 
ble prediction, especially in situations where many variables 
may exist that interact in a nonlinear manner as is often the 
case with photo-z estimation. A visualization of an example 

tree generated by our technique is shown in Figure [Tj 

There are two classes of prediction trees ( Breiman et al 



19841: classification and regression, both of which are imple- 



mented in TPZ. 

(i) Classification Trees (also called Decision Trees): As the 
name suggests, this type of prediction tree is designed to 
classify or predict a discrete category from the data. Each 
terminal leaf contains data that belongs to one or more 
classes. The prediction can be either a point prediction based 
on the mode of the classes inside that leaf or distributional 
by assigning probabilities for each category based on their 
empirically estimated relative frequencies. For example, in 
our photo-z technique we use the magnitudes or colors of 
galaxies to determine the probability that a galaxy lies ei- 
ther inside or outside a specific redshift bin (a detailed ex- 



planation of the algorithm is presented in [ 2.4 1 



The tree is built by starting with a single node that encom- 
passes the entire data, and recursively splitting the data 
within a node into two or more branches along the dimen- 
sion that provides the most information about the desired 
classes. Formally this is done by choosing the attribute that 
maximizes the Information Gain (Ig), which is defined in 
terms of the impurity degree index Id'- 



I G (T,M)=I d (T) 



m e values(M) 



~\T\ 



■Id{T„, 



(1) 



where T is the training data in a given node, M is one of 
the possible dimensions (e.g., magnitudes) along which the 
node may be split, m are the possible values of a specific 
dimension M (in the case of magnitudes m might represent 
2 or more magnitude bins), |T| and \T m \ are respectively 
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Figure 1. A simplified example of a binary prediction tree plotted 
radially. The initial node is close to the center of the figure. The 
splitting process terminates when a stopping criterion is reached. 
Individual colors represent the unique variable (e.g., fixed aper- 
ture g or r or magnitude colors) used for the splitting at each 
node. Each leaf provides a specific prediction based on the infor- 
mation contained within that terminal node (gray triangles in the 
figure). The subpanel corresponds to zoomed in region from the 
tree. 

the size of the total training data and the number of objects 
for a given subset m within the current node, and Id is 
the function that represents the degree of impurity of the 
information. 

There are three standard methods to compute the impurity 
index (Id)- The first method is by using the information 
entropy, which is defined in the expected manner (similar to 
Thermodynamics) : 

n 

I d {T) = H{T) = -Y J filog 2 f i (2) 

where i is the class to be predicted (e.g., inside or outside a 
redshift bin) and the sum is over all n possible classes (two 
in our example), and fi is the fraction of the training data 
belonging to class i. The same definition applies for a subset 
of the data T m . 

The second option, is to measure the Gini impurity (G). In 
this case, a leaf is considered pure if all the data contained 
within it have the same class. The Gini impurity can be 
computed inside each node: 

n 

I d (T)^G(T) = J2J2f^ ( 3 ) 

where fi and fj are the fractions of the training data of class 
i or j . The same equation applies for a subset of T along one 
particular dimension M. Since fi are the fractions for all 
possible classes, we have that the £L /,; = 1, and, therefore, 
Yljjti /j ~ 1 ~ fi- As a result, the expression for Equation|3] 
can be simplified to 

n 

I d {T) = G{T) = l-Y J fi (4) 

i—l 




Fraction of Class i = l (/j) 



Figure 2. Impurity index I d for a two-class example as a function 
of the probability of one of the classes fi using the information 
entropy (blue), Gini impurity (green) and classification error (red). 
In all cases, the impurity is at its maximum when the fraction of 
data within a node with class 1 is 0.5, and zero when all data are 
in the same category. 



The third method is to simply measure the impurity degree 
by using the classification error (Ce)- 

I d (T) = C E (T) = l-max{/ s } (5) 

where the maximum values are taken among the fractions 
fi within the data T that have class i. During the tree con- 
struction, the data are scanned over each dimension to de- 
termine the split point that maximizes the information gain 
as defined by Equation [l] and the attribute that maximizes 
this impurity index overall is selected. For example, Fig- 
ure [2] shows these three impurity indices, for a node with 
data that are only categorized into two classes, as a func- 
tion of the fraction of the data having a specific class. If all 
of the data belong to a specific class, the impurity is zero. 
On the other hand, if half of the data have one class and the 
remaining data all belong to the other class, the impurity is 
at its maximum. Our implementation can calculate any of 
these three different impurity indices, and any one of them 
can be selected for the construction of the prediction trees. 
Alternatively, the index providing the highest information 
gain at a given node can be selected, 
(ii) Regression Trees A second type of prediction tree is used 
when the data to be predicted is continuous; since it does 
not use discrete classes, we instead fit a regression model to 
the data inside a leaf. The construction of a regression tree 
follows the same structure as the classification tree, and once 
again a node is generally divided into two branches (i.e., a 
binary tree). There are two primary differences, however, 
between regression and decision trees. First, each leaf has 
training data with different redshift values; the prediction 
value is based on a regression model covering these points. 
Usually, the mean of the training redshifts is returned, so 
each prediction is no longer a discrete classification, but is 
instead an estimation of a continuous variable. Second, the 
procedure used to select the best dimension to split for a 
regression tree is based on the minimization of the sum of 
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the squared errors, which for a node T is given by 



2.2.1 Ancillary information 



m e values(M) iem 



(6) 



where m are the possible values (bins) of the dimension M, 
Zi are the values of the target variable on each branch/bin 
m, and z m is the specific prediction model used. In the case 
of the arithmetic mean, we have that z m — — Y\ Zi, 
where n m are the members on branch m. This allows us to 
rewrite Equation [6] as 



S(T) = Yl 



(7) 



where Kn is the variance of the estimator z m . 
At each node in our tree, we scan all dimensions to identify 
the split point that minimizes S(T). The splitting dimension 
that has the lowest value of S is selected as the splitting 
direction, and this procedure is repeated until either some 
threshold in S is reached or any new nodes would contain 
less than the predefined minimum leaf size. 



2.2 Random forest 

Random forest is an ensemble learning algorithm that first 
generates many prediction trees and subsequently combines 
their predictions together. It is one of the most accurate em- 
pirically trained learning techniques for both low and high 
dimensional data ( Caruana et al.|2 008). The idea is simple, 
given a training sample T containing N objects that have 
M attributes (e.g., survey magnitudes), create Nt bootstrap 
samples of size N (i.e., N randomly selected objects with re- 
placement). From these samples, we create the correspond- 
ing Nt prediction trees without pruning them back. 

If all the variables are examined when deciding the best 
point to split, the method is called bagging (Brci rrian|1996 1. 
An additional layer of randomness can be added to the bag- 
ging process by choosing the best split point from among a 
random subsample of m* < M variables at each node, where 
m* is kept fixed during the process. The value of m* is an 
adjustable parameter that is directly related to the strength 
of a tree (a strong tree has a low error rate) and the corre- 
lation between any two trees (the more correlated the trees, 
the higher the forest error rate). Increasing or reducing m* 
has the same effect on both features. Of course we want to 
select the optimal value of m*. A good starting point is to 
set m* ~ vM, although the accuracy of the algorithm is, 
in the end, not very sensitive to this parameter for a large 
number of trees and relatively small number of dimensions. 
After constructing all of the prediction trees, a final and ro- 
bust prediction is calculated by combining all Nt estimates 
together. 



Breiman (20011 first introduced this algorithm and 



showed that this technique performs very well when com- 
pared to many other learning techniques. This technique is 
robust against overfitting (i.e., there is no limit on the num- 
ber of trees, Nt, in the forest), it runs efficiently on large 
data sets, it can generate an internal unbiased estimate of 
the error, and it can provide extra information about the 
relative importance of the input variables and the internal 
structure of the training data. 



Given a training set T, this extra, ancillary information can 
be calculated prior to the computation of the photo-z PDFs. 
As a result, we can use this a priori information to explore 
the efficacy of different parameter combinations while also 
obtaining an estimate of the bias and variance of the photo-z 
prediction. This is done by using out-of-bag (OOB) samples, 
which consist of a random sample of data that are left out 
of each tree. In the process of growing a forest, Nt trees 
are created using bootstrap samples of size N. In each of 
these samples, about one-third of the data are not used when 
constructing a tree, and are instead used as a test sample for 
the recently built tree. The test results created by using this 
OOB data are combined together to obtain estimators of the 
error, which, when built using a sufficiently large number of 
trees in the forest, has been shown to be unbiased and as 
accurate as using a validation set of the same size as the 



training set (Breiman 1996J) . This removes, therefore, the 
need for a separate validation sample that can introduce a 
bias into the final result. This method also has the advantage 
of using the full spectroscopic data to compute PDFs. 

The OOB data can also be used to estimate the rela- 
tive importance of each attribute or dimension to the photo- 
z calculation. This provides an elegant method to identify 
and remove attributes that do not contribute significantly, 
thereby reducing the noise and dimensions of the problem. 
This also has the benefits of increasing the performance of 
the implementation, improving our understanding of the 
complexity in the interaction between different attributes, 
and improving the identification of new training data from, 
for example, follow-up observations. This relative impor- 
tance is estimated for each attribute by first quantifying 
any variations in the prediction error when the OOB data 
are permuted only along the specific attribute, leaving the 
others unchanged. This process is repeated across all trees, 
and the end result is the average in the error increment when 
compared to the unperturbed variables for all the tress over 
the entire forest. 

Another item we can construct is the proximity matrix, 
Prox(i,j), which is a symmetric, positive definite matrix 
that gives the fraction of trees in the forest in which element 
i and j fall in the same terminal leaf. This matrix is con- 
structed tree-by-tree by running all the data, both the OOB 
and the data used for growing, down each tree. When galaxy 
i and j are in the same leaf, their proximity is increased by 
one. At the end, all the proximities are normalized by the 
total number of trees; therefore, similar galaxies will tend 
to have higher proximities than dissimilar ones. This matrix 
can be computed for the training set, the test set, or both 
together. Since this matrix quantifies the relative similarity 
between galaxies, it can be used to identify outliers within a 
data set. For example, by computing the squared sum of all 
proximities for each galaxy, we can algorithmically identify 
galaxies with few neighbors by selecting sources with the 
lowest value, which can be flagged for further inspection. 

To build or apply prediction trees, the data cannot have 
missing values for any of the attributes used to construct the 
trees (e.g., the most important survey magnitudes). To in- 
clude more data into the classification process, we can use 
the proximity matrix to estimate any missing values or to 
replace highly uncertain values. We do this in an iterative 
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process, by performing the forest growing step of the al- 
gorithm and replacing the missing attribute at each pass. 
We select the replacement value by computing the average 
parameter value from the k nearest galaxies; we can also 
inversely weight these galaxies by their respective distance. 
This process continues until we have obtained convergence 
or until a fixed number of iterations have been performed. 

By using the proximity matrix, OOB error estimates, 
and the relevant importance of different attributes, we can 
also identify zones where the photo-z prediction is either 
poor or is loosely constrained by the training data. In either 
case, this knowledge is of vital importance when deciding 
what galaxies to target spectroscopically in order to opti- 
mally improve a training sample. One way this feature is 
implemented is by using the two most important attributes 
to map the areas of parameter space by their prediction 
error. This map can guide the identification of new data 
that increases the efficacy of the training sample by target- 
ing those galaxies that minimize the prediction error in un- 
der sampled areas, thereby more effectively utilizing limited 
spectroscopic follow-up observations. 

2.3 Previous work 

Two previous works have utilized prediction trees for photo- 



z calculations. Carliles et al. (2008 20101 predicted photo- 



metric redshifts and their errors by using a random forest 
built with the regression tree package for R by using the 
mean value as their leaf model. They used a subset of the 



main galaxy sample of the SDSS Data Release 6 ( Adelman- 
|McCarthy et al.|2008[ ) catalog with colors as their attributes. 
They demonstrated that random forest methods are well 
suited to the photo-z estimation problem as they obtained 
comparable results to other machine learning methods, and 
they publicly released their R scripts. They did not, how- 
ever, take full advantage of the ancillary information pro- 
vided by the random forest technique, nor did they produce 
probability density functions. 



Gerdes et al. (20101 have developed a new technique, 



called ArborZ, to compute photometric redshifts using 
boosted decision trees (BDT). These classification trees are 
constructed in a similar manner to our classification trees, 
as discussed in §2.1| In their approach, all data points 
start with equal weights, but after each tree is built, higher 
weights are assigned to points that were previously misclas- 
sified. This process iteratively combines weak classifiers into 
a single stronger one ( Schapire et al.|1998 l; and, in the end, 
a weighted vote across the classifiers produces the final pre- 
diction. In their approach, they divide the redshift range 
into small bins and use an ensemble of BDTs to generate 
a probability distribution. A photometric redshift is esti- 
mated by determining the mean value of this distribution. 
They tested this algorithm on SDSS DR6 data as well as 
DES simulated data, finding similar performance to other 
empirical training methods, such as the photo-z estimates 



provided by Oyaizu et al. ( 2008b I in the case of the SDSS 
data, and by ANNz ( |Collister fc Lahav||2004[ ) for the DES. 

Our approach, detailed below, extends these previous 
results to create a new publicly available method that uses 
random forests to compute PDFs by using classification 
and/or regression trees. Our approach also uses extra in- 
formation encoded within the measurement errors, gener- 



ates extra, ancillary information describing the spectro- 
scopic training sample, and provides a better control of the 
uncertainties. We also, therefore, are able to examine the 
importance of the attributes used to grow the trees, and 
identify areas in the attribute space where the training data 
are dominated by shot noise statistics. 



2.4 TPZ Algorithm 

Our implementation of prediction trees with random forest 
for photometric redshift PDF prediction, TPZ, is written in 
the Pythorj^Jprogramming language and uses MPI for paral- 
lel communication to run efficiently on distributed memory 
systems. As shown in Figure [3] our implementation is di- 
vided into three steps: 

Data Pre-processing The first step prepares the data for 
the construction of the prediction trees. First, we optionally 
perform a principal component analysis (PCA) of the data in 
order to reduce strong correlations between attributes. This 
PCA transformation can reduce the dimensionality of the 
input data prior to the training, which can be important for 
large data sets with many attributes. This step also includes 
the replacement of missing values, which we do iteratively, 
finding that between 5-10 iterations leads to a convergence 
on the missing values. We next generate Nr training samples 
by perturbing the measured values according to the error on 
each variable, which we assume to be normally distributed. 
In this manner, we can incorporate the measurement error 
in the prediction tree construction, we reduce the bias on 
proximity matrices, and we introduce randomness into the 
construction of the trees in a controlled manner. 
Random Forest Construction The second step is the 
actual construction of the random forest, where we gener- 
ate fully grown prediction trees. We construct Nt trees by 
using bootstrapping for each perturbed sample in the set of 
Nr training samples we created in the first step. This step 
can be done several times with a smaller number of trees 
to both explore the parameter space and gain insight into 
the internal structure of the data prior to building the final 
prediction trees. Finally, this step can also produce the an- 
cillary information that can characterize the performance of 
our code prior to estimating the final photo-z values. 
Photo-z PDF Construction The final step uses the 
newly generated prediction trees to create individual photo- 
z PDFs for each source in the application data set. This 
process involves running each source down each tree, test- 
ing the source at each node until we arrive at a terminal leaf 
where we make a prediction. At the end, we combine all of 
the forest predictions into a probability density function. 



2.4-1 Implementation modes 

TPZ can use either type of prediction tree that uses random 
forests: classification or regression; the actual implementa- 
tion details only differ after the first step. 

Classification Mode: In this mode, the spectroscopic 
sample is divided into several redshift bins that either have 
a fixed width (or, alternatively, resolution), which allows a 



http:/ /www. python. org/ 
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Figure 3. A simplified representation of TPZ, the details for each subprocess are described more fully within the text. Note that each 
tree drawn in the figure represents a full random forest with Nx bootstrap samples for every one of the Nr random perturbed samples. 
The big circles containing galaxies represent a terminal leaf, which are directly used to make a prediction for each new galaxy. 



variable number of galaxies within each redshift bin, or have 
a fixed number of galaxies per redshift bin, which means 
our redshift bins are of variable width. Within each bin, we 
create a forest of classification trees, as described above, us- 
ing the perturbed samples as well as the bootstrap samples. 
These trees classify an object as either lying inside or out- 
side a bin. By using all of the training data within each bin, 
we both decrease the overall performance of our implemen- 
tation due to the larger data volume and also increase the 
chance of catastrophic errors since most data will lie outside 
the bin of interest. 

We address these issues by following a similar approach to 



that used by [Gerdes et al. ( 2010p . For each bin, we identify 



all sources that lie inside the bin. This number of galaxies 
with class inside is rii n . We next select a factor fn ou t of 
the nin galaxies that have spectroscopic redshifts that lie 
outside the bin by a factor of z out times the width 8 z of 
the bin. This means that galaxies with class outside fall 
Zout x 8z from the boundaries of the bin. This allows a better 
distinction between the class inside and the class outside as 
it would have if we include objects located very near to these 
boundaries. In the end, each bin will have (1 + fn ou t)riin 
galaxies available for training the forest. 
If the training set is limited, wider bins can be used in or- 
der to have a sufficient number of training galaxies per bin. 
Furthermore, these bins can even be allowed to overlap by 
some value; this overlap can be taken into account when 
building the photo- z PDFs by normalizing by the fraction 
of wider bins that overlap with each other. After all of the 
forests are created for all of the bins, the test data are run 
down each tree in each forest, which assigns either the class 



inside or outside to the test source. After combining all of 
the assigned classes from the forest, we assign a probability 
for the source to belong to that redshift bin, which is simply 
the number of times the source was assigned the inside class 
divided by the total number of trees. By repeating this pro- 
cess for each bin and renormalizing the subsequent result, 
we generate a photo-z PDF for the source. 
Regression Mode: In this mode, we use all available 
training data to fully grow each tree. For each perturbed 
sample, Nt trees are created using the methodology ex- 



plained in [2.1 ii) At the end, there is one large random 
forest covering the entire spectroscopic range. The differ- 
ence with the classification mode is that, after the tree has 
been constructed by splitting the nodes according to Equa- 
tion [7] each terminal leaf only ends up with a few sources to 
make the prediction. In the simple case of obtaining a single 
estimate, this leaf can be replaced by the mean or the me- 
dian of the values inside it; more generally, these values are 
kept for computing the PDF. To compute a photo-z, the test 
data are run down each tree in the forest. Each tree returns 
the set of spectroscopic redshift measurements that, after 
conversion to a given resolution, are converted into a PDF 
by normalizing to the total number of objects returned. All 
trees have the same weight when constructing the PDF, as 
well as the values of the terminal leaves identified in each 
tree. If a single value is desired, a mean value and its error 
can be returned via the standard methods by aggregating 
all of the relevant values as returned by the different trees. 

The choice of either of these modes will depend on the 
characteristics of the data being analyzed. On average, the 
regression mode runs faster than the classification mode for 
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a specific accuracy, and is also better suited for data that are 
not uniformly distributed. The classification mode, on the 
other hand, provides a better characterization of the data 
as a function of redshift, since it creates its own random 
forest on each bin unlike the regression mode where a forest 
is created using the full range in redshift. The classification 
mode is also better suited for uniformly distributed data 
and can provide a reliable and robust prior probabilities in 
a Bayesian framework when using wider redshift bins. When 
faced with a high quality and rich training set, both modes 
will provide similar accuracies and error rates, but the re- 
gression mode, being faster, would generally be preferred. 

Figure [3] shows a simplified workflow of our TPZ im- 
plementation. Each tree in this figure represents an entire 
forest, where the single tree results are averaged to get a 
final prediction. The classification mode predicts a prob- 
ability that a source lies within each bin, thereby build- 
ing up a photo-z PDF, while the regression mode keeps all 
sources found on a terminal leaf and combines their values 
to construct a photo-z PDF at the desired resolution. For 
both modes, ancillary information can be provided, and both 
modes share the same data pre-processing steps. 



3 DATA SELECTION 

To demonstrate the capabilities and the efficacy of the dif- 
ferent parameter configurations of the TPZ code, we have 
used several different photometric and spectroscopic data 
sets that vary both in quantity and quality. In this section 
we briefly discuss these data sets and the specific data sam- 
ples from each that we used in the testing process, which is 
further described in Q 



3.1 Sloan Digital Sky Survey 



The Sloan Digital Sky Survey (SDSS; [York et al.|2000| ) phase 
I and phase II conducted a photometric survey in the op- 
tical bands u, g, r, i, z that covered almost 10,000 square 
degrees, or approximately one-fourth of the entire sky. The 
resultant photometric catalog contains photometry of over 
10 s galaxies, making the SDSS one of the largest surveys 
ever performed. The SDSS also conducted a spectroscopic 
survey of targets selected from the SDSS photometric cata- 
log, obtaining spectra of about 10 6 low redshift galaxies. 

In this paper, we use a subset of the Main Galaxy Sam- 
ple (MGS; Strauss et al.|2002l from the Data Release 7 cat- 



alog ( Abazajia n et al.|2009[ |. Specifically, we selected 55,000 
galaxies by using the online CasJobs websit^] This spec- 
troscopic data ranges from z « 0.02 up to z w 0.3 with a 
mean redshift of 0.1. From this sample, we randomly se- 
lected 15,000 galaxies to train the TPZ implementation, 
while holding the remaining 40,000 for testing. We note that 
this is a blind test, as the testing data are not used in any 
way to train or calibrate the TPZ algorithm. Of all the mea- 
sured attributes in the SDSS photometric catalog, we have 
used only the four dimensions corresponding to the galaxy 
colors as derived by the extinction corrected model magni- 
tudes : u — g, g — r, r — i, and i — z. We use the SDSS 



colors as opposed to the more commonly used magnitudes 
for this particular test to both demonstrate the flexibility of 
TPZ and to generate scientifically more interesting ancillary 
information. 



3.2 PHoto-z Accuracy Testing Project 



The PHoto-z Accuracy Testing (PHAT; |Hildebrandt et al.| 



20101 project first compared the performance and system- 



atics of different photo-z codes on synthetic data (PHAT0) 
that was specifically created for a contest, and also more re- 
cently used real data (PHAT1) in a similar manner; thereby 
providing a more realistic comparison by using real mea- 
surements. The PHAT pro jecl[^] provides filter responses for 
photo-z estimation by SED-fitting methods and a training 
data set for photo-2 estimation by empirical methods. The 
true redshifts of the test data are not public, which pro- 
vides a more reliable, blind comparison between different 



approaches (see Hildebrandt et al. 2010 for more details 
about the contest). In this paper, we use the PHAT1 data, 
which consists of real observations selected from the Great 
Observatories Origins Deep Survey Northern field (GOODS- 
N; |Giavalisco et al.|2004| . 

These data include photometry from the original ACS 
four-band data: F435W(B), F606W(V+R), F775W(i') and 
F850LP(z') that have been cross-matched with photometry 
from|Capak et al.| ([20041), including U (from KPNO), Bj, 



Vj,R c ,Ic,z (from SUBARU), and HK (from QUIRC). 
In addition, the photometry of PHAT1 also includes Deep J 
and H bands (from ULBCA M; |Wang et al.|2006[ ), K s (from 
WIRC; |Bundy et aL]|200^ , and four Spitzer IRAC bands: 
3.6, 4.5, 5.8, and 8.0u.m. This photometric catalog was 
cross-matched with all available spectroscopic GOODS-N 



data (|Cowie et al.|2004"||Wirth et al.|2004||Treu et al.|2005 



Reddy et al. 2006L producing a final data set of eighteen 



band photometry and spectroscopy for 1,984 galaxies. 

For the contest, only 515 galaxy redshifts were pub- 
lished for use as training data; the remaining redshifts were 
unpublished and used internally by the PHAT project to 
conduct a blind comparison test. Despite the limited train- 
ing data, multiple authors submitted the photo-z predictions 



and the results were published in Hildebrandt et al. (20101 



As the contest had already been completed when we started 
this work, we were unable to participate. However, as dis- 
cussed in [J4] we have tested TPZ on the PHAT1 training 
data in an analogous manner as the contest and have sub- 
mitted our results to the official PHAT wiki. 



3.3 Deep Extragalactic Evolutionary Probe 

The Deep Extragalactic Evolutionary Probe (DEEP) is a 
multi-phase, deep spectroscopic survey performed with the 
Keck telescope. Phase I used the LIRS instrument (Oke 



et al. 



1995[), while pha se II used the DEIMOS spectro- 
graph ( |Faberet al.|2003| ). The DEEP2 Galaxy Redshift Sur- 
vey is a magnitude limited spectroscopic survey of objects 



with Rab < 24.1 ( |Davis et aL]|2003| |Newman et al.||2012 | 
The survey includes photometry in three bands from the 
CFHT 12K: B, R, and I and it has been recently extended 



4 http:/ /casjobs. sdss.org/Cas Jobs/ 



5 www.astro.caltech.edu/twiki_phat/bin/view/Main/WebHome 
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Figure 4. Photometric vs. spectroscopic redshift for all test SDSS MGS test galaxies using regression mode (Left) and classification 
mode (Center). (Right) A comparison of the bias (upper panel) and the scatter (lower panel) as a function of redshift for the SDSS MGS 
data by using the regression mode (blue dots) and the classification mode (green squares). 



by cross-matching the data to other photometry databases. 



In this work, we use the Data Release 4 iMatthews et al. 



|2013j), the latest DEEP2 release that includes secure and 
accurate spectroscopy for over 38,000 sources. The photom- 
etry for the sources in this catalog was expanded by us- 
ing two u, g, r, i, and z surveys: the Canada-France-Hawaii 
Legacy Survey (CFHTLS [Gw^n][20T2| , and the SDSS. For 
additional details about the photometric extension of the 



Table 1. A comparison between the Regression Mode and the 
Classification mode for the SDSS MGS galaxies with different 
confidence level restrictions. 



DEEP 2 catalog see Matthews et al. (20131 



To use the DEEP2 data with TPZ, we selected sources 
with secure redshifts (ZQUALITY^ 3) that were securely 
classified as galaxies, have no bad flags, and have full pho- 
tometry. Even though the filter responses are similar, the 
u, g, r, i, and z photometry come from two different sur- 
veys and are thus not identical. We therefore treat those 
galaxies with SDSS photometry for fields 2,3, and 4 of the 
DEEP2 target areas independently from those for field 1 
with CFHTLS photometry. In the end, this leaves us with 
a total of 19,699 galaxies with eight band photometry and 
redshifts, of which we use 10,000 for training and hold the 
rest for testing. 



4 APPLICATION/DISCUSSION 

In this section, we apply the photo- z estimation technique 



presented in [2A to the SDSS main galaxy sample (MGS), 
the PHAT1 blind test sample, and the DEEP2 sample, 
which were all introduced in Sj3] Since the point of this paper 
is to introduce the TPZ algorithm and our associated imple- 
mentation, we use these three different data sets to highlight 
different features of the code. Thus we do not apply TPZ 
uniformly to each data set, and the three subsections herein 
are necessarily different. 



4.1 SDSS Main Galaxy Sample 

We first apply TPZ to the SDSS main galaxy sample, us- 
ing both the regression and the classification methods as 
explained in §2.4. 1| and we present the results in Figure [4] 
The left and center panels compare the estimated photo- 
metric redshifts to the spectroscopic redshifts for all 40,000 
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galaxies held out for testing from the SDSS MGS, for re- 
gression and classification modes, respectively. Both imple- 
mentations show similar performance in the central part of 
the redshift distribution; however, there are differences at 
both the low and high redshift regions of this sample. The 
right panel shows both the mean of the bias, defined as 
Az = z S poc — Zzphot, and its scatter for eight redshift bins. 
The regression mode performs slightly better at all redshift 
bins, but especially on the first and last bin, where the clas- 
sification mode shows systematic errors in classification. 

This error arises due to the lack of training data at those 
redshifts for the classification mode, where, though we allow 
some overlap between bins, we keep the bin size constant, 
which can result in large differences in the number of train- 
ing objects per bin. This reduction is most pronounced in 
the lowest and highest redshift bins, which results in a lower 
accuracy and a higher scatter. We also are affected at the 
low redshift regime by the fact that a predicted redshift can 
not be negative, those introducing a positive skew to the 
predicted redshift values for very low redshifts. 

Since both implementation modes produce photo-z 
PDFs, we can compute confidence levels, zConf, around 
the mean (or mode) for each individual PDF. To simplify 



10 M. Carrasco Kind and R. J. Brunner 




0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 
redshift redshift 



Figure 5. Four example PDFs produced by TPZ for the SDSS 
MGS selected with different values of zConf . The higher the value 
of zConf, the more narrowly concentrated the PDF is about the 
mean. The vertical dashed line corresponds to the spectroscopic 
value for the test galaxy and the gray area encloses the confidence 
level. 



comparisons with past results, we define zConf as the inte- 
grated probability between z p hot±£>"TPz(l + Zphot)- We select 
o~tpz ~ 0.03 as an approximation to the intrinsic scatter 
of the algorithm when applied to the data, which can be 
computed by using the OOB data. Of course we could de- 
fine zConf in some other manner, but the results would be 
relatively unaffected. Figure [5] presents four different PDFs 
taken from the SDSS MGS, each with different confidence 
levels that are shown as a bounded gray area under each 
PDF curve. 

In this example, we measured zConf around the mean 
of each PDF and the actual spectroscopic redshifts are 
shown as vertical dashed lines for reference. From this fig- 
ure, we see that zConf provides a reasonable summary of 
the concentration of the PDF, and can, therefore, be used 
to further restrict a photo- z sample by selecting only those 
PDFs with a zConf value above some threshold. In gen- 
eral, as shown in this Figure, we see that lower confidence 
values are strongly correlated with less accurate predictions. 
Nevertheless, it is still possible to have a small fraction of 
galaxies with high zConf PDFs that are estimated at the 
wrong redshift. We discuss the zConf parameter and its use 
in identifying a clean galaxy sample in further detail in §4.2| 

In Table [l] we present the mean value of the different 
performance metrics described in the previous paragraphs, 
as applied to the SDSS MGS, as well as the fraction of 
remain galaxies that remain in the sample after a cut on 
zConf. As before, we see that, on average, the regression 
mode outperforms the classification mode on this data set, 
although the difference is reduced when we apply a cut on 
the confidence level. Interestingly, at more restrictive zConf 
cuts, the performance of both modes is similar; however, the 
number of galaxies remaining in the regression mode sample 
is higher. Note that since these are averaged values over the 



sample, any minor change implies a significant change on 
individual calculations. 

As a result, we believe that making a cut on zConf 
results in a cleaner sample, as shown by the improved per- 
formance metrics for either implementation mode. The dif- 
ference in the fraction of galaxies that remain in each sample 
indicates that, on average, PDFs generated by the classifi- 
cation implementation are broader than PDFs generated by 
the regression implementation. This result is reasonable, as 
the classification mode bins the redshift space and provides 
probabilities for all bins which can produce a more sparse 
distribution. In the classification mode the probabilities are 
computed individually for each redshift bin, which could be 
important and easily extended to build a prior distribution 
that can be used in a Bayesian method. Since the regression 
mode was shown to be more accurate for the SDSS (see, 
e.g., figure [I] and Table [l), we use the mean of the PDF as 
calculated by the regression mode on the SDSS MGS data 
in the rest of this section, unless otherwise indicated. 

We can broadly compare our use of zConf to define 
clean galaxy samples to other published results; we note 
that a direct, one-to-one, comparison is problematic due to 
the different training sets and attributes used in computing 
photometric redshifts for the SDSS main galaxy sample. If 
we take a zConf >— 0.75, we keep 91% of the data and 
compute the fraction of galaxies with \Az\ < Zi, where = 
0.001, 0.002 and 0.003 as 45.2%, 73.0% and 89.8%, respec- 
tively. These valued compare favorably to those from |Lau-| 
rino et al. (20111 who, even though they used an extended 
catalog, compute these same values to be 43.4 %, 72.4% 
and 86.9%, with a mean bias of < Az >= 15 x 10 -3 and 
(jaz = 1-52 x 10 -2 (these latter values can be compared 
with our results shown in Table [TJ . Finally, we note that 
making a strict cut of Az > 0.006 identifies an outlier frac- 
tion of 1.54%, while other groups, using extended catalogs 
as well, have reported values of 1. 9% (Ge rdes et al.|[2010[ ) 
and 2.6% ( |Oyaizu et al.|2008b |. 



4-1.1 Ancillary information 



As detailed in §2.2. 1| we can use the out-of-bag data to 
compute extra, ancillary information about the SDSS MGS 
dataset. For this purpose, we first select approximately one- 
third of the objects from each bootstrap sample. Using these 
data, we compute an unbiased indicator of the bias (i.e., 
Az) and its standard deviation (i.e., ctaz) for each tree. Fi- 
nally, we average these metrics over all trees. In Figure [6] 
we present in the top panel the mean bias as a function of 
redshift taken both from the test data (blue line) and from 
the OOB data used during the training process (green line). 
The bottom panel in this figure presents the standard devia- 
tion for each redshift bin. The RMS of these values provides 
an approximation to the intrinsic error and scatter of the 
TPZ code, which can be used to compute confidence levels. 
From the OOB data, we compute the RMS of the bias to 
be 0.0064, which can be compared to the value of 0.0017 
obtained directly from TPZ for the SDSS MGS test sample. 
Likewise, we can approximate the scatter; for the OOB data 
we have 0.0235, while for the SDSS MGS test sample we have 
0.0203. Thus, the OOB data provide upper limits for these 
metrics calculated by using only the training sample. 

This OOB technique is unique due to the fact that the 
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Figure 6. {Top) The averaged Az as a function of redshift for 
all test galaxies from the SDSS MGS (blue circles) and from the 
OOB (Out-Of-Bag) data computed individually for each tree and 
subsequently averaged over the forest (green squares). (Bottom) 
The standard deviation of Az as a function of redshift for the test 
set (blue circles) and the OOB data (green squares). In this case 
the OOB data provide a unbiased, upper-limit for these metrics. 



OOB data were not used to train a particular tree, yet the 
full data are used when building the forest by using the boot- 
strap samples. If we would have run all of the training data 
after the forest was constructed without using the OOB ap- 
proach, we would have obtained biased (although lower val- 
ues) for these metrics. This approach would thus not pro- 
vide a prior estimation of the accuracy of TPZ. With the 
OOB data, we compute a priori these unbiased estimates 
exclusively from the training set, without the need for a val- 
idation set, allowing us to take full advantage of all available 
spectroscopic data. 

The OOB data can also be used to compute the relative 
importance of each attribute, which can be done by permut- 
ing each of the attributes in the non OOB data when training 
the tree. The result of this process can be directly compared 
with the unperturbed case using the OOB data, as shown 
in Figure [7] In this figure, the left panel shows the relative 
importance factor, which is computed by using the absolute 
value of the OOB bias as a comparison metric, of the four 
colors used to build the regression trees for the MGS sam- 
ple. In this plot, a factor of one implies that the attribute 
acts as a random variable, since a perturbation along that 
direction produces no changes. Any value greater than one 
produces a change in the bias, making it larger and therefore 
less accurate. 

From this figure, we see that the g — r color shows 
the largest relative importance factor, being close to four, 
meaning that the absolute bias, on average, changes by this 
same factor when this color is randomly perturbed. On the 
other hand, the i — z color is the least, on average, rele- 
vant attribute in this context, with a relative importance 
factor less than 1.5. Due to the limited number of attributes 
in this test, however, removing this last color actually pro- 
duces slightly worse results. In the general case when more 



attributes are present, removing less important variables will 
improve the results. While this result might seem counter- 
intuitive, it results naturally from the random nature of the 
tree construction. Since only m attributes (e.g., three) are 
randomly selected to decide the split dimension, an attribute 
with overall low importance can be occasionally selected to 
split a node. By omitting attributes with lower importance, 
we force the trees to be built from attributes with greater 
information content, thereby improving the accuracy of the 
prediction. 

Another interesting point is that the relative impor- 
tance for both of the mentioned colors remain consistent, 
independent of redshift, while the other two colors show 
variation (i.e., u — g and r — i exchange importance rat- 
ings more than once), although they are overall consistent 
with each other. This behavior is mainly due to important 
spectral features, such as the 4000 A break, passing between 
different filters, which TPZ identifies algorithmically, as im- 
portant indicators of a galaxy's redshift. We see this result 
from another perspective in the central panel of Figure [7J 
which presents the RMS of the relative importance, sorted 
by their rank, for the four colors computed by using the ab- 
solute bias (blue line) and the variance (red line) . Both met- 
rics rank the attributes in the same order and either can be 
used to compute their importance to the data set. Perturb- 
ing the attributes produces a stronger effect on the absolute 
bias than on the scatter, mainly because when perturbing 
one dimension, we lose information and thereby increase the 
likelihood that a galaxy will end up in a random branch of 
the tree, especially for an important attribute. This would 
likely lead to a misclassification, which directly affects the 
mean absolute bias. 



Relative Importance 

The importance rank can also be used to better understand 
the training data, to check whether it is possible to reduce 
the dimensionality of the problem, and to identify areas of 
the mapped parameter space where new training data can 
be most effectively incorporated. This latter point can be 
accomplished by identifying the leaf nodes, and the galaxies 
contained therein, for each tree and computing their accu- 
racy on predicting for the OOB data along with their prox- 
imity matrices. By averaging over these results for all trees, 
we obtain the desired result. 

For example, by using the two most important at- 
tributes previously identified for the SDSS MGS (g — r, and 
u — g), we present a heat map in the right panel of Figure [7] 
that encodes the binned performance of these two attributes, 
where higher values indicate lower predictive success in that 
bin. In this plot, we see there are a few bins where perfor- 
mance is markedly lower (blue and light blue squares), and 
several areas that are lower than average (the yellow bins). 
On the other hand, there are two areas where the predic- 
tive power is quite high (deep orange-red), which are likely 
the result of the known color bi-modality of SDSS galax- 
ies ( Strateva et al.|2001 ) where early-type galaxies lie in the 
upper right part of this plot and late-type galaxies lie in the 
bottom left part of this plot. The areas in this heat map 
where the predictive performance is low can be caused by 
either a lack of training data, by galaxies with color degen- 
eracies, or by galaxies with higher than normal magnitude 
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Figure 7. (Left): The attribute importance factor as a function of redshift for the four attributes (SDSS colors) used in this analysis 
for the bias only. This factor quantifies how much the metrics decrease as we permute the attributes one at a time. (Central): RMS of 
the relative importance factor as a function of the attributes computed by using the bias (blue) and the scatter( red). (Right): A heat 
map constructed by using the two most important attributes, which indicates areas of parameter space where the photo-z prediction is 
poor. The higher the value (i.e., bluer) in a region, the more training data are needed to increase the accuracy of photo-z estimation 
within that region. These zones might also contain outliers or galaxies with bad photometry. 



errors. As a result, these areas can be prioritized for follow- 
up observations to improve the performance of the photo-z 
estimation. 



Identifying new training data 

Previously, we had stated that the relative importance of 
the different attributes, graphically shown in the heat map 
in the right panel of Figure [7] could be used to optimally 
identify new training data. We test this assumption by first 
randomly selecting 1,000 galaxies as our training set, in or- 
der to simulate a poor training set, so that we can quantify 
the effects of both randomly adding new data and selectively 
adding new data by using the relative importance. We per- 
form this test by first adding 1,000 new galaxies and second 
by adding 2,000 galaxies and computing the mean normal- 
ized bias, defined as Az' = (z spcc — z phot )/(l + z spcc ), and 
its standard deviation as we change the size of the training 
set by using the four color attributes from the SDSS MGS 
and and a forest with 100 prediction trees. 

We summarize these test results in Table [2] As shown 
in the table, selecting galaxies from those zones with lower 
accuracy as indicated by the heat map produces more ac- 
curate predictions than adding galaxies randomly. In fact, 
even adding 1,000 galaxies by using the heat map produces a 
slightly better performance than adding 2,000 galaxies ran- 
domly. These results indicate that it is more important to 
selectively add galaxies to areas where the prediction is poor 
than to simply increase the size of the training set. 

We continue this process, by continually adding either 
1,000 or 2,000 new galaxies to the training set. As the bot- 
tom panel of Figure[9]for the SDSS MGS demonstrates, after 
about 5,000 galaxies (or at half the size of our full training 
set), the performance metric shows little variation, which is 
also reflected in the last row of Table[2]where the metrics for 
the 15,000 galaxy training set are presented for comparison. 
This test demonstrates how current and future photomet- 
ric surveys can optimally construct training sets by either 
selectively using existing observations, or by obtaining new 



Number of training galaxies 


< Az' > 




1,000 


-0.0043 


0.042 


1,000 + 1,000 from random 


-0.0037 


0.038 


1,000 + 1,000 from map 


-0.0033 


0.032 


1,000 + 2,000 from random 


-0.0034 


0.036 


1,000 + 2,000 from map 


-0.0022 


0.025 


15,000 


-0.0018 


0.021 



Table 2. A comparison of the performance of TPZ for the SDSS 
MGS when extra data are added to the training set either ran- 
domly or by selectively using ancillary information. 

spectroscopic observations to improve the photo-z estima- 
tion. 

Error distribution 

After applying TPZ to the SDSS MGS, we can estimate 
photo-z errors directly from the estimated PDF by comput- 
ing either the mean, the mode, or some other statistic from 
this distribution. As a demonstration, we calculate the er- 
ror (768 as the region of the photo-z PDF centered on the 
mean that contains 68% of the cumulative probability. We 
next calculate the distribution of these standard errors by 
computing (z p hot — z spec )/o"68 for each PDF, which is shown 
as the black points in Figure [8] For unbiased standard error 
estimates, this distribution should be normally distributed 
with zero mean and unit variance. When we fit our measured 
points, we obtain a Gaussian with mean equal to 0.112 and 
a width of 0.949, which is shown by the solid green curve. 

This simple error estimate is quite close to the unbiased 
expectation, which is as we would expect for any reliable 
technique. The fit is not a perfect Gaussian due to a slightly 
extended tail on the left hand side of the distribution. We 
interpret this as a manifestation of the very narrow PDFs 
we have obtained and that the SDSS MGS is concentrated 
at lower redshifts where most photo-z techniques suffer from 
a small tendency to over-predict the photometric redshifts, 
as shown in the left panel of Figure [4] 
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Figure 8. The photometric standardized error, (z p hot ~ 
2spec)/o"68! for the MGS galaxies (black dots) using the mean 
of each individual PDF and the best fit Gaussian with /i = 0.112 
and a = 0.949 (solid green curve). 

Size of forest 

When we construct a forest for prediction, one parameter 
that must be specified is the number of trees that should be 
constructed. This is important as the more trees in the for- 
est, the higher the computational demands, which slows the 
training process and construction of photo- z PDFs. Thus, we 
test the performance of TPZ for the SDSS MGS by varying 
the number of trees built for our forest for a fixed-size train- 
ing sample. As before, we compute the mean of the absolute 
bias and its standard deviation, and present how these quan- 
tities vary as the number of trees in our forest changes for a 
fixed training size of 10,000 galaxies. 

These results are presented in the top panel of Figure [5J 
which shows that our algorithm does become more accurate 
as the number of trees increases. However, after around 100 
trees, the predictive power of the forest shows little varia- 
tion, indicating that this is a reasonable number of trees for 
this prediction process. |Breim an (2001) demonstrated that, 
as the number of trees in a random forest increases, any 
margin function will converge to a limit value. Thus, as ex- 
pected, we see our generalized error value converging. As a 
result, this implies that our technique does not over- fit the 
data as more trees are added in comparison to other machine 
learning methods. 



Training size 

Once we know the optimal number of trees that must be 
built for our forest, we next need to know the optimal size 
of our training set. By using 100 trees (as determined in 
the previous section), we vary the size of our training set 
and present the results in the bottom panel of Figure [9] 
As shown in this figure, the accuracy of TPZ for predict- 
ing photo- z does not change significantly after using around 
70% of the galaxies. This is an interesting result, that our 
approach quantifies in an elegant manner, but which will 
obviously vary between different data sets. Fundamentally, 
as the training set increases in size, the prediction accuracy 
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Figure 9. The absolute mean normalized bias denned as |Az'| = 
|(zspec — Zphot) 1/(1 + Zspcc), and its scatter as a function of the 
number of trees in the forest, keeping the training set fixed (top). 
The same two values as a function of the size of the training set 
keeping the number of trees fixed at 100 for galaxies in the SDSS 
MGS (bottom). 



also increases until most of the multi-dimensional parame- 
ter space has been sampled and little extra information is 
added by new training galaxies. 

Of course in this test we have not used the relative im- 
portance of our parameter attributes, as shown, for exam- 
ple, in the central panel of Figure [7] By manually selecting 
additional data, we should be able to reduce the values of 
these metrics significantly, which is discussed in the next 
section. But even in our current approach, we expect that 
some of our test data are not well represented in our train- 
ing set, which will limit the accuracy of this approach. We 
see this as an opportunity, however, as we can compute a 
cross-data proximity matrix by using the trained forest to 
identify galaxies within the test data that are isolated with 
few neighbors in the parameter space. Once identified, these 
galaxies could be treated individually by using, for exam- 
ple, other photo- z estimation techniques (see, e.g., Carrasco 
Kind & Brunner 2013, in preparation). 



4.2 PHAT1 blind test 

We also tested TPZ on the PHAT1 dataset, described in 
§3.2| which is a blind contest where the test spectroscopic 
redshifts are unknown to the competitors. Therefore, this 
provides a reliable method to compare the performance of 
different photo- z techniques. In this contest, only a limited 
quantity of training data are provided; we have approxi- 
mately 500 galaxies to train our algorithm for the approxi- 
mately 1,500 galaxies that form the validation sample. These 
data also have a sparse redshift distribution, extending from 
z ~ to z ~ 6. Despite these limitations, we applied TPZ 
to this training data, submitted our results to the contest, 
and obtained the resulting performance metrics from the 
PHAT leader (H. Hildebrandt, private communication). We 
present our specific results in Table |3j which can be com- 
pared directly with the results shown in Table 5 of the PHAT 
paper ( |Hildebrandt et al.||2010 i. 
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Table 3. The TPZ results for the PHAT1 catalogue both with 
and without the IRAC bands, and for all galaxies and for a 
magnitude-limited sample with R < 24. Note that these are the 
same statistics presented in Table 5 of Hildebrandt et al. (2010) 
for other photo-z estimation techniques. 



Run 



bias a 



scatte 



f 



outlier rate' 



18-band 
14-band 

18-band R < 24 
14-band R < 24 



-0.002 
-0.007 
-0.004 
-0.009 



0.055 
0.055 
0.055 
0.054 



14.1 % 
12.6 % 
11.1 % 

9.6 % 



a bias is defined as: Az' = Zap , a ° Zphot 

l + Zspec 

b RMS of the bias Az' 

c Outliers are defined as objects with \Az'\ > 0.15. 




We computed validation results for four different pho- 
tometric samples: by using all eighteen photometric bands, 
by omitting the Spitzer photometry and using only four- 
teen photometric bands, and by creating magnitude limited 
(R < 24) for each of these two galaxy samples. For these 
validation runs, we use the regression mode to create a for- 
est of 150 trees with m, = 4 (as described in j j2.2| . In all 
runs, we made no cuts on the zConf parameter so that we 
could more directly compare our results to the other com- 
petitors. In the end, the TPZ results are among the most 
accurate photo-z predictions, especially when compared to 
other empirical training codes. Interestingly enough, TPZ 
even outperforms some template photo-z techniques, which 
are supposedly better suited for this particular challenge due 
to the dearth of training data and large redshift range cov- 
ered by the validation sample. These results show that even 
in less than ideal conditions, TPZ provides a robust estima- 
tion of photometric redshifts. Note that due to the lack of 
training data and the extended redshift distribution of the 
validation sample, we did not generate ancillary information 
for the data by using the OOB approach. 

DEEP2 

We have also tested TPZ by using the DEEP2 redshift sur- 
vey data, which extends to much higher redshifts than the 
SDSS MGS. As described in |3.3| we treat the galaxies with 
CFHTLS photometry independently from those with SDSS 
photometry, but in the end we merge the photo-z results. 
We follow a similar analysis to what we used with the SDSS 
MGS, and after we compute the photo-z PDFs, we select 
only those galaxies with zConf > 0.7, which includes about 
81% of the galaxies. We have that the average bias, using 



Az' = (z 



-*pbot)/(l + 3s P ec), is -0.007 with a Az , = 0.059 



and a outlier rate, defined as |Az'| > 0.15 = 2.9%. We know 
of no previous photo-z analyses of these data (described 
in j ]3.3[ ) with which to compare these results. The results 
are presented in Figure fTo] which compares the photo-z com- 
puted by using the mean of each individual PDF with the 
spectroscopic redshift for the 7,856 galaxies. In this figure, 
we also compute the median, shown by the black dots, and 
the tenth and ninetieth percentiles, shown by the black error 
bars, within spectroscopic bins of width Az = 0.1. 

As this figure demonstrates, we see consistent results 



Figure 10. The TPZ photo-z with zConf > 0.7 versus the spec- 
troscopic redshifts for 7,856 galaxies selected from the DEEP2 
redshift survey. The black dots are the median values of z p hot 
and the errors bars correspond to the tenth and ninetieth per- 
centiles within a given spectroscopic bin of width Az = 0.1. 



across all redshifts, and both the isodensity contours and 
the errors bars indicate that there are few outliers or catas- 
trophic photo-z. However, at both ends of the distribution, 
we see several bins that show that the photo-z results are 
less accurate and are systematically higher for the first two 
bins and systematically lower for the last two bins. This 
effect is often seen with empirical techniques, as the spec- 
troscopic training samples are often less complete at these 
redshifts, see, e.g., the redshift distribution in Figure |15| 
Another effect causing this skewness is that estimated pho- 
tometric redshifts can not be negative, thus our probability 
distribution can not be symmetrical at the low redshift end. 
Another possible explanation for the low redshift systematic 
is the effect of galaxy inclination and the induced extinction 



on photo-z prediction as shown recently by Yip et al. ( 2011 1 



Likewise, the systematic underestimation at higher red- 
shifts is likely affected by the fact that many of these galaxies 
are near the limit of the photometry and thus have higher 
than average magnitude errors. In combination with the 
lower density of training data, this will reduce the efficacy 
of our photo-z technique. To understand this effect, recall 
that our trees are built from objects whose photometry is 
sampled by assuming a normal distribution defined by the 
magnitude and magnitude error from the bootstrap sam- 
ples. As the magnitude error increases, the range of possible 
values to sample increases, thereby producing a sparser sam- 
pling for this galaxy within our forest. Since there are few 
galaxies with redshifts above 1.3 in the training data, the 
branches on the forest for high-z galaxies are mainly domi- 
nated by training galaxies with redshifts closer to 1. As we 
build the PDF for the high-z galaxies, the PDFs will be pos- 
itive skewed, and thus the mean value of each PDF will tend 
to be at lower redshift values. 

We demonstrate this skewness in Figure [TT] which 
shows the average skewness of the photo-z PDFs and the 
one-sigma error as a function of the spectroscopic redshift. 
These two quantities are computed as the third standardized 
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Table 4. A comparison in the accuracy of photo-z predication 
by using different attribute combinations from the DEEP2 data 
for all test galaxies. The first row are the metrics for TPZ using 
only OOB data, which are comparable to the values obtained 
from the full test data, shown in the second row. The remaining 
rows provide these metrics for training data that have had the 
indicated number of attributes removed from the calculation. 



Attribute Selection 


< \Az'\ > 




All attributes (OOB metrics) 


0.052 


0.053 


All attributes 


0.047 


0.049 


Remove 2 least important 


0.044 


0.046 


Remove 2 most important 


0.061 


0.068 


Remove 4 least important 


0.044 


0.048 


Remove 4 most important 


0.070 


0.084 



Figure 11. The skewness of the photo-z PDF as a function of 
spectroscopic redshift. The solid black line is the mean of the 
skewness and the pink shaded region corresponds to the one-cr 
interval. Positive skewness indicate a PDF skewed to lower red- 
shifts. 

moment: 

S k = j \{z)dz (8) 

with, 

z = J zp(z)dz and, cr z = J (z — z) 2 p(z)dz (9) 

where the integrals are computed over the redshift domain, 
and p(z) is the photo-z PDF. We can see that for redshifts 
up to 1.1 the average skewness is very close to zero, showing 
a small trend to negative values, which will, on average, pro- 
duce lower values for the mean photo-z. At higher redshifts, 
however, there is a clear increase in the average skewness, 
which will tend to produce lower values for the mean of the 
PDF. It is important to note that even though these PDFs 
may be (slightly) skewed, they still predict sufficient proba- 
bility near the true redshift, information that is overlooked 
by other methods that use one point predictions. On the 
other hand, a catastrophic photo-z would have a symmetric 
PDF centered near the wrong redshift, which is not what we 
observe here. 

Relative Importance 

By using OOB data, we have computed the metrics from the 
training data that we compare in Table [4] to the metrics we 
obtained from the test data after the photo-z distributions 
were computed. The first two rows of this table show the 
complete results for all attributes for the DEEP2 galaxies. 
From this we see that there is strong agreement between 
the OOB and test data results for both the bias and the 
variance. We also computed the relative importance for the 
eight photometric bands and the RG attribute, which is the 
estimated R-band radius of an object in 0.207" pixels (i.e., 
the sigma of the Gaussian fit to the light distribution). 

In the left panel of Figure |12| we present the attribute 
importance factor as a function of redshift for the three most 
and the two least important attributes. From this figure, we 



see that the R band and the r band are the most important 
attributes for making a photo-z prediction, similar to the 
g — r color for the SDSS MGS. This demonstrates that by a 
pure statistical analysis, in the optical regime, the R band is 
the most effective attribute. These attributes show a peak in 
their importance between redshifts of 0.3 to 0.5. We interpret 
this increase to the presence of the 4000Abreak being located 
at these redshifts within these filters. Likewise, the next two 
most important attributes are the i band and the z band, 
which are likely important for the same reason, albeit over 
a slightly higher redshift range. 

On the other hand, the least two important attributes 
are the B band and the RG attribute. As shown in this fig- 
ure, the RG attribute does not contribute to the photo-z 
prediction, instead acting like a random variable and thus is 
likely introducing extra noise into the calculation. We also 
see no clear evidence that the effect of this attribute changes 
with redshift. At low redshifts, this attribute could be af- 
fected by inclination angle or spectral type, while at higher 
redshifts, galaxies tend to be fainter and thus have smaller 
angular sizes. Presumably, these cumulative effects combine 
to erase any important information this attribute might pro- 
vide to the photo-z calculation. 

In the right panel of Figure |12| we present the mean 
relative importance for each attribute computed from the 
changes in the mean (blue circles) and the scatter (red 
squares) when this attributed is permuted, similar to the 
central panel of Figure [7] Once again, both metrics agree 
with the importance ranking. In order to characterize the at- 
tributes and their computed ranking of importance, we have 
made the following tests. First, we removed the two least im- 
portant attributes, after which we remove the two most im- 
portant attributes, reincorporating the previously removed 
attributes. We repeat this process, but now we remove al- 
ternately the four least and most important attributes. In 
each case, we test whether TPZ is able to correctly recognize 
the attribute importances. These results are summarized in 
Table [4] where we use the absolute mean value of Az' and 
its dispersion. 

As is not surprising, we see that removing the two least 
important attributes does, in effect, improve the precision 
of TPZ while also making the code run faster since we have 
fewer dimensions to check when splitting nodes within the 
tree, less data to keep in memory when building the tree 
thus improving cache access, and random realizations from 
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Figure 12. (Left): The variable importance factor, Ia, as a function of redshift for the three most and the two least important attributes 
(i.e., DEEP2 magnitudes) using the bias to quantify this importance. This index specifies how important an attribute is to the calculation 
of a metric when we permute the attributes one at a time. (Right): The RMS of the attribute importance factor as a function of the 
attributes computed by using the bias (blue) and its scatter (red). Both of these metrics capture the same relative attribute importance. 



the input parameters will be faster since there are fewer di- 
mensions to sample. Yet, removing four attributes shows a 
slight decrease in the overall performance, in this case we 
have removed too much information. While this decrease 
might seem rather small, since we are randomly selecting 
attributes when splitting nodes within the trees, by remov- 
ing four we have increased the scatter since we are losing 
information. On the other hand, removing the most impor- 
tant attributes significantly affects the results, regardless of 
how many attributes we remove. As we would expect, the 
reason is clear. Since these attributes have the most infor- 
mation needed to subdivide the multidimensional parameter 
space in order to produce accurate photo-z, removing them 
negatively impacts the performance of TPZ. 

In a further control test we added two extra artificial 
variables to the data set, one of which is strongly dependent 
on the source redshift, i.e. a function of redshift, while the 
second one is a uniformly distributed random variable. After 
computing their importance rankings, we can see from Fig- 
ure [13] that TPZ recognized these two extra attributes and 
put them on the extreme limits of the importance ranking. 
The most important value ranks at about eight, the random 
variable ranks at one as expected, while the r magnitude is 
ranked with a value close to two. We notice that the variable 
RG is very close to one, and therefore to a random variable. 
As discussed above, we can safely remove this variable from 
our calculation as it does not provide any useful informa- 
tion. The legend on the plot indicates also the descending 
order in importance, in concordance with Figure fl~2] 

Missing data 

One interesting capability of TPZ is that it can be used to 
replace attributes in data that are either missing attributes 
or have attributes with large uncertainties. As discussed in 
§2.2. 1| the replacement values can be computed from the 
proximity matrix, and we can apply this technique to data 
either in the training sample or in the application sample. In 
the former case, missing attributes would be replaced in or- 



der to maximize the size of the training set. The alternative 
would be to simply cull data with missing attributes from 
the training sample, which would decrease the robustness of 
our predictive power. In the latter case, missing attributes 
would be replaced in order to estimate a photometric red- 
shift for a galaxy based on the incomplete but available in- 
formation. In most cases, this will still result in a reliable 
prediction, without discarding any data, thereby increasing 
the overall statistical power of our approach. 

To demonstrate this capability, we selected training and 
testing data sets that initially were complete and had rela- 
tively small errors (i.e., magnitude errors < 1 magnitude). 
We first randomly replaced 50% of the magnitudes in the 
training data with a bad value (e.g., 99), thus some galaxies 
in this sample have multiple bad attributes. From this new 
data set, we apply TPZ to generate a second training sam- 
ple where the bad attributes have been replaced, using only 
six iterations (i.e, the replaced sample), and we also gen- 
erate a third training sample where we simply remove any 
galaxies with missing or bad attributes (i.e., the cut sam- 
ple). Likewise, we also generated a test sample with 50% of 
the attributes replaced by the bad value (i.e., the bad test 
sample) 

We estimate photo-zs for the clean test sample by 
using all three training samples: the original, clean sam- 
ple (i.e., the control), the replaced attribute sample, and 
the cut sample. Likewise, we use the clean training sam- 
ple to replace missing attributes and estimate photo-zs for 
the bad test sample. We present the results of these tests 
in Table [5j where we compare the photo-z estimation for 
the clean sample with the replaced and cut samples. For 
this comparison, we use Az pp = z photiC i can - z ph ot,othor and 
Amag — ma(? c i oan — rnag thav, along with their variances, 
where other can either be the replaced or cut samples. As 
shown in this Table, the replaced value sample produces, on 
average, superior photo-zs than the cut sample. Likewise, 
we have estimated robust photo-zs for the bad test sam- 
ple, which significantly increases the size of our resulting 
test data. Dealing with missing attributes is important , es- 
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Table 5. Photo-z estimation metrics to demonstrate the robust- 
ness of our missing attribute technique. The first two rows show 
the average bias and its variance between the estimated photo-z , 
and replaced magnitude when either removing or recovering bad 
data in comparison with photo-zs predicted using the original, 
clean sample. The last row shows the the same metrics calculated 
by using the clean test sample, but for data missing in the test 
sample as compared to the clean original sample. 
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Figure 13. The variable importance factor, 7^4, as a function 
of redshift for the most and least important attribute using the 
bias to quantify the importance ranking. As a control test, we 
added two artificial variables: an attribute that is a function of 
the spectroscopic redshift, and a uniformly distributed random 
attribute. TPZ is able to recognize these two extra attributes and 
rank them accordingly, as shown by the figure legend. 



pecially when a spectroscopic training sample is limited or 
when cross-matching between incomplete catalogs is carried 
out in order to develop a more complete catalog for photo-z 
estimation. 



Photo-z PDFs and zConf 



As discussed in §4.1| the zConf parameter can be used to 
identify galaxies with narrow, concentrated photo-z PDFs, 
which ideally will result in galaxy samples that have the 
most accurate photo-z estimates. The zConf parameter is 
demonstrated for DEEP2 galaxies in the left panel of Fig- 
ure |14| which shows four representative photo-z PDFs se- 
lected with different values of zConf as measured about 
the mean of each PDF. Both this figure and Figure [5] which 



presents four photo-z PDFs by using SDSS data, highlight 
the fact that wide and sparse distributions have low zConf 
values while narrower PDFs have higher zConf values. 

The goal of a parameter like zConf is to algorithmi- 
cally identify galaxies that have, on average, the most accu- 
rate photo-z estimates. To test this hypothesis, we used all 
available DEEP2 training data to bud our prediction trees 
and estimate photo-zs for the DEEP2 test sample. From this 
sample, we applied three zConf cuts: 0.5, 0.7, and 0.9, and 
calculated the bias and scatter as a function of redshift for 
the three resulting galaxy samples. We compare these re- 
sults to the bias and scatter when no zConf cut is applied 
in the right hand panel of Figure [14] As shown in this figure, 
both the mean absolute bias and the scatter are reduced as 
zConf is increased, independent of redshift. 

A simple, intuitive approach to select galaxies by their 
zConf would be 0.5 as this selects galaxies that have a 50% 
probability that their photo-z redshift estimate lies within 
the limits imposed by ±<TTPz(l+z p hot)- Furthermore, higher 
values would provide more accurate results at the expense 
of reduced statistical power (i.e., a smaller, final catalog). 
In Figure |14| for example, cuts on zCon f at 0.5, 0.7 and 
0.9 keep 90%, 76% and 38% of the galaxies from the orig- 
inal catalog. Alternatively, given the OOB data predictive 
results, a required accuracy or number density can be used 
to identify a suitable value of zConf. 

N(z): An application of photo-z PDFs 

Most of the results we have presented within this paper have 
been based on the estimation of a single metric computed 
from the photo-z PDF, for example the mean or mode. Obvi- 
ously, using a single value to represent the PDF wastes sig- 
nificant information, but since many photo-z applications 
mimic spectroscopic redshift applications, new approaches 
must be developed to capitalize on the full information con- 
tent of a photo-z PDF. As a result, we present a simple, 
yet very important application that uses the full photo- 
z PDFs — estimating the galaxy redshift distribution, N(z). 
This function is a fundamental measurement and is very 
important to a number of cosmological applications includ- 
ing weak lensing tomography (e.g., Mandelbaum et al. 2008 
|Jee et al.|2012[ ) and projecting three-dimensional theoretical 
power spectra to angular clustering measurements (Hayes 
et al.|2012 i 



We compute the normalized galaxy redshift distribu- 
tion, N(z), for all the galaxies in DEEP2 sample (i.e., no 
DEEP2 redshift confidence selection cut was applied), shown 
in Figure [15] as the shaded gray area. As demonstrated by 
this figure, in this spectroscopic survey, most galaxies were 
selected to have redshifts between 0.6 and 1.2. Next, we com- 
pute the binned photometric redshift distribution by using 
the mean value from each photo-z PDF, shown by the red 
curve. While this curve does trace the gross features of the 
underlying spectroscopic redshift distribution, it fails to cap- 
ture the full detail and can be significantly different at cer- 
tain redshifts, including at the mode. For comparison, we 
show in black the photo-z PDF redshift distribution that 
we obtain by simply stacking the individual PDFs together. 
With this simple approach, we obtain a more accurate rep- 
resentation of the true sample redshift distribution. Here we 
have used all the galaxies, without selecting galaxies by their 



18 M. Carrasco Kind and R. J. Brunner 




Figure 14. (Left): Same as Figure[5]but for four example galaxies taken from DEEP2. The vertical dashed line indicates the spectroscopic 
redshift and the gray area the zConf value. (Right): The absolute normalized bias and the scatter for galaxy samples defined by different 
zConf cuts by using the mean of the photo-z PDF as our estimate. 



confidence level. This demonstrates that all individual PDFs 
computed with TPZ carry important information about the 
underlying distribution. 

These differences are more clearly exposed in the bot- 
tom panel of Figure |15[ where we show the absolute frac- 
tional error, (N p h ot — N spRC )/N avec , as a function of redshift, 
using the same color scheme as before. From this figure, we 
see that the stacked PDF has a smaller error for almost all 
redshifts. In addition, the photo-z PDF redshift distribu- 
tion is considerably smoother and looks more like a fit to 
the spectroscopic sample, which is another benefit of using 
the full photo- z PDF. For this particular demonstration, the 
photo-z PDF presented used a bin size of 0.002, while the 
spectroscopic and photometric redshift distributions used a 
bin size of 0.03. Of course, we can generate smoother distri- 
butions for either the spectroscopic or photo-z mean value 
redshift distributions by reducing the bin size, however, the 
trade off is that we run the risk of increasing the shot noise 
in the resulting distribution. 



5 CONCLUSIONS 

In this work we have presented and publicly releas^TPZ, 
a new, parallel, machine learning photo-z Python code that 
computes photo-z PDFs and also provides ancillary infor- 
mation about the photometric data. TPZ is based on the 
construction of prediction trees and consequently a random 
forest. Overall, TPZ is a three step algorithm that first pre- 
processes the data, completes galaxies with missing photo- 
metric values in an efficient manner, and also incorporates 
measurements errors. A photo-z PDF can be generated from 
the prediction trees in one of two modes: classification or re- 
gression. Both modes produces similar accuracies, but the 
regression mode is preferred when either the training data 
are either poorly sampled or not uniformly distributed. On 
the other hand, the classification mode provides a detailed 
synopsis of the redshift distribution that can be used to con- 
struct priors for use with other photo-z techniques. 

B http:/ /ledm. astro. illinois.edu/research/TPZ. html 



We demonstrated the efficacy of the TPZ algorithm and 
its implementation by applying this new code to three dif- 
ferent data sets: the SDSS main galaxy sample, the PHAT1 
blind challenge, and the DEEP2 survey. With the SDSS 
MGS, we demonstrated that using confidence levels is im- 
portant as they improve the overall accuracy of our photo-z 
sample by selecting those galaxies with narrow PDFs. This 
technique is unique in the sense that it does not need a 
separate validation test, yet provides ancillary information 
by using OOB data. We have shown that with these data, 
we obtain unbiased estimates of both the bias and the dis- 
persion, which are very similar to the same values obtained 
from the test data for both the SDSS MGS and DEEP2. 
Obviously, this result is extremely important when working 
with data that have unknown redshifts. 

TPZ not only provides these prior metrics, but it also 
provides a ranking of the relative importance of the dif- 
ferent photometric attributes that are used by the code. 
This completely statistical process recovers what is natu- 
rally expected from physical consideration of these different 
attributes. With this importance ranking, we can construct 
a heat map of the different locations in parameter space that 
produce poor photo-z estimations. Furthermore, we demon- 
strated that by adding new, manually selected data we can 
produce more accurate photo-z predication than by simply 
adding new galaxies randomly. This implies that we can op- 
timally identify new training data for current and future 
photometric surveys, such as DES or LSST, in order to im- 
prove their photo-z predictions. 

The attribute importance can also be used to remove 
those attributes that are least important, thereby improving 
the computational speed. In addition, we demonstrated that 
the performance metrics converge as the number of trees in- 
creases in the forest, providing a further method to reduce 
the computational time since we have a direct measure of 
the minimum forest size. Likewise, we also demonstrated by 
using the SDSS MGS that these same metrics also converge 
with the number of training galaxies for a fixed forest size. 
Thus, except for adding in manually selected training data 
to improve areas with poor photo-z prediction, we have an 
explicit limit for the number of training galaxies needed. 
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Figure 15. (Top): The redshift distribution for the all DEEP2 
spectroscopic sample of galaxies (shaded gray histogram), com- 
puted from the mean value of individual photo-z PDFs (red 
curve), and computed by stacking individual photo-z PDFs (black 
curve). (Bottom): The residual absolute error between the spec- 
troscopic redshift distribution and the two photo-z redshift dis- 
tributions shown using the same color scheme. 



Finally, with this technique we found that the error distri- 
bution was characterized by a Gaussian distribution with 
a mean very close to zero and variance very close to one, 
indicating that the source of errors is relatively unbiased. 

We ran our code on the PHAT1 blind challenge with 
excellent results; even with limited training data we were 
able to compute accurate photo-z 's that were compara- 
ble if not better to other empirical techniques as well as to 
some SED fitting techniques. By using the DEEP2 redshift 
data, we tested TPZ over a large redshift range, obtaining 
very accurate results. In particular, we were able to iden- 
tify the important attributes, which in this case was the 
R band magnitude followed by the I band magnitude, and 
the least important attributes, which in this case was the 
RG attribute and the B band magnitude. Despite these im- 
pressive results, we still have a slight systematically biased 
photo-z at very low and very high redshifts, which we pri- 
marily believe is caused by the low number of training data 
at these redshifts and also the fact that photo-z estimates 
can not be negative. We also see a positive skewness in the 
photo-z PDFs at high redshifts. We believe this result is due 
to the fact that these galaxies tend to be fainter and have 
larger magnitude errors. These larger magnitude errors pro- 
duce a sparser forest at higher redshifts, which is manifested 
by having a lower photo-z PDF mean value at these same 
redshifts. 

We have also demonstrated how the zConf parameter 
can be used to select galaxy samples that have improved 
photo-z estimates with minimal outliers. A target value for 
this useful parameter can be set to a desired photo-z preci- 
sion either by calculating the value expected by using OOB 
data or as required by a specific cosmological requirement. 
Likewise, we have demonstrated how TPZ can efficiently 
handle missing data within a catalog. By artificially gen- 
erating bad or missing parameter values within both the 
training and the testing data sets, we were not only able 



to robustly recover the missing parameters but more impor- 
tantly new photo-z estimates that are consistent with the 
photo-z estimates from the original, full data set. Therefore, 
this technique increases the power of photo-z estimation by 
recovering missing data from the training catalog as well as 
the power of our resulting sample statistics by recovering 
missing data from the application data set. 

Finally, by calculating the normalized distribution of 
galaxies as a function of redshift, we were able to demon- 
strate the advantages of using a full photo-z PDF as opposed 
to using one single estimator of the PDF or any other point 
metric. Specifically, by simply stacking each individual PDF, 
we recover the underlying galaxy redshift distribution to a 
much higher precision than by simplifying using the mean 
of each individual photo-z PDF. 

In conclusion, we note that since TPZ is an empiri- 
cal algorithm, it is inherent dependent on the quality of its 
training data. Thus, as is the case with all empirical algo- 
rithms, TPZ is limited by the available spectroscopic train- 
ing data. Furthermore, the application of TPZ to regions of 
parameter space beyond the limits of the training data (i.e., 
extrapolation) will be less reliable. We do note, however, the 
TPZ does provide ancillary information that can be investi- 
gated to better understand the limitations imposed by the 
training set, to identify the optimal locations within the ap- 
plication data space where new training data will be most 
useful, and to quantify the possible errors associated with 
the extrapolation of this technique. Another approach to 
improve the efficacy of photo-z estimation beyond the lim- 
its of a spectroscopic training sample would be to include 
the predictions from different, non-empirical techniques into 
a meta-classifier. We will explore this approach in a future 
work on this topic. 
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